28 research outputs found
High-for-Low and Low-for-High: Efficient Boundary Detection from Deep Object Features and its Applications to High-Level Vision
Most of the current boundary detection systems rely exclusively on low-level
features, such as color and texture. However, perception studies suggest that
humans employ object-level reasoning when judging if a particular pixel is a
boundary. Inspired by this observation, in this work we show how to predict
boundaries by exploiting object-level features from a pretrained
object-classification network. Our method can be viewed as a "High-for-Low"
approach where high-level object features inform the low-level boundary
detection process. Our model achieves state-of-the-art performance on an
established boundary detection benchmark and it is efficient to run.
Additionally, we show that due to the semantic nature of our boundaries we
can use them to aid a number of high-level vision tasks. We demonstrate that
using our boundaries we improve the performance of state-of-the-art methods on
the problems of semantic boundary labeling, semantic segmentation and object
proposal generation. We can view this process as a "Low-for-High" scheme, where
low-level boundaries aid high-level vision tasks.
Thus, our contributions include a boundary detection system that is accurate,
efficient, generalizes well to multiple datasets, and is also shown to improve
existing state-of-the-art high-level vision methods on three distinct tasks
TALLFormer: Temporal Action Localization with Long-memory Transformer
Most modern approaches in temporal action localization divide this problem
into two parts: (i) short-term feature extraction and (ii) long-range temporal
boundary localization. Due to the high GPU memory cost caused by processing
long untrimmed videos, many methods sacrifice the representational power of the
short-term feature extractor by either freezing the backbone or using a very
small spatial video resolution. This issue becomes even worse with the recent
video transformer models, many of which have quadratic memory complexity. To
address these issues, we propose TALLFormer, a memory-efficient and end-to-end
trainable Temporal Action Localization transformer with Long-term memory. Our
long-term memory mechanism eliminates the need for processing hundreds of
redundant video frames during each training iteration, thus, significantly
reducing the GPU memory consumption and training time. These efficiency savings
allow us (i) to use a powerful video transformer-based feature extractor
without freezing the backbone or reducing the spatial video resolution, while
(ii) also maintaining long-range temporal boundary localization capability.
With only RGB frames as input and no external action recognition classifier,
TALLFormer outperforms previous state-of-the-art methods by a large margin,
achieving an average mAP of 59.1% on THUMOS14 and 35.6% on ActivityNet-1.3. The
code will be available in https://github.com/klauscc/TALLFormer.Comment: 15 pages, 2 figure
Long Movie Clip Classification with State-Space Video Models
Most modern video recognition models are designed to operate on short video
clips (e.g., 5-10s in length). Because of this, it is challenging to apply such
models to long movie understanding tasks, which typically require sophisticated
long-range temporal reasoning capabilities. The recently introduced video
transformers partially address this issue by using long-range temporal
self-attention. However, due to the quadratic cost of self-attention, such
models are often costly and impractical to use. Instead, we propose ViS4mer, an
efficient long-range video model that combines the strengths of self-attention
and the recently introduced structured state-space sequence (S4) layer. Our
model uses a standard Transformer encoder for short-range spatiotemporal
feature extraction, and a multi-scale temporal S4 decoder for subsequent
long-range temporal reasoning. By progressively reducing the spatiotemporal
feature resolution and channel dimension at each decoder layer, ViS4mer learns
complex long-range spatiotemporal dependencies in a video. Furthermore, ViS4mer
is faster and requires less GPU memory than the
corresponding pure self-attention-based model. Additionally, ViS4mer achieves
state-of-the-art results in out of long-form movie video classification
tasks on the LVU benchmark. Furthermore, we also show that our approach
successfully generalizes to other domains, achieving competitive results on the
Breakfast and the COIN procedural activity datasets. The code will be made
publicly available